archetypal analysis
Coresets for Archetypal Analysis
Archetypal analysis represents instances as linear mixtures of prototypes (the archetypes) that lie on the boundary of the convex hull of the data. Archetypes are thus often better interpretable than factors computed by other matrix factorization techniques. However, the interpretability comes with high computational cost due to additional convexity-preserving constraints. In this paper, we propose efficient coresets for archetypal analysis. Theoretical guarantees are derived by showing that quantization errors of k-means upper bound archetypal analysis; the computation of a provable absolute-coreset can be performed in only two passes over the data. Empirically, we show that the coresets lead to improved performance on several data sets.
Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs
Phillips, Edward, Wu, Sean, Molaei, Soheila, Belgrave, Danielle, Thakur, Anshul, Clifton, David
Large language models demonstrate impressive results across diverse tasks but are still known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy for hallucination detection, requiring estimates for both global uncertainty (attributed to a batch of responses) and local uncertainty (attributed to individual responses). While recent black-box approaches have shown some success, they often rely on disjoint heuristics or graph-theoretic approximations that lack a unified geometric interpretation. We introduce a geometric framework to address this, based on archetypal analysis of batches of responses sampled with only black-box model access. At the global level, we propose Geometric V olume, which measures the convex hull volume of archetypes derived from response embeddings. At the local level, we propose Geometric Suspicion, which leverages the spatial relationship between responses and these archetypes to rank reliability, enabling hallucination reduction through preferential response selection. Unlike prior methods that rely on discrete pairwise comparisons, our approach provides continuous semantic boundary points which have utility for attributing reliability to individual responses. Experiments show that our framework performs comparably to or better than prior methods on short form question-answering datasets, and achieves superior results on medical datasets where hallucinations carry particularly critical risks. We also provide theoretical justification by proving a link between convex hull volume and entropy. Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks (Guo et al., 2025; Anthropic, 2025; Gemini Team, Google DeepMind, 2025; OpenAI, 2025) and are increasingly applied in areas such as medical diagnosis, law, and financial advice (Y ang et al., 2025; Chen et al., 2024; Kong et al., 2024). Hallucinations, however, where models generate plausible but false or fabricated content, pose significant risks for adoption in high-stakes applications (Farquhar et al., 2024). Recent work, for example, finds GPT -4 hallucinating in 28.6% of reference generation tasks (Chelli et al., 2024).
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- South America > Brazil (0.04)
- Asia > Pakistan (0.04)
- (2 more...)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Germany (0.04)
- North America > Canada (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Germany (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Incorporating Fairness Constraints into Archetypal Analysis
Alcacer, Aleix, Epifanio, Irene
Archetypal Analysis (AA) is an unsupervised learning method that represents data as convex combinations of extreme patterns called archetypes. While AA provides interpretable and low-dimensional representations, it can inadvertently encode sensitive attributes, leading to fairness concerns. In this work, we propose Fair Archetypal Analysis (FairAA), a modified formulation that explicitly reduces the influence of sensitive group information in the learned projections. We also introduce FairKernelAA, a nonlinear extension that addresses fairness in more complex data distributions. Our approach incorporates a fairness regularization term while preserving the structure and interpretability of the archetypes. We evaluate FairAA and FairKernelAA on synthetic datasets, including linear, nonlinear, and multi-group scenarios, demonstrating their ability to reduce group separability -- as measured by mean maximum discrepancy and linear separability -- without substantially compromising explained variance. We further validate our methods on the real-world ANSUR I dataset, confirming their robustness and practical utility. The results show that FairAA achieves a favorable trade-off between utility and fairness, making it a promising tool for responsible representation learning in sensitive applications.
- North America > United States (0.28)
- Europe > Spain (0.04)
A Survey on Archetypal Analysis
Alcacer, Aleix, Epifanio, Irene, Mair, Sebastian, Mørup, Morten
Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure to extract the distinct aspects called archetypes in observations with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data with wide applications throughout the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This survey provides researchers and data mining practitioners an overview of methodologies and opportunities that AA has to offer surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data using AA and limitations. The survey concludes by explaining important future research directions concerning AA.
- North America > United States > Montana (0.14)
- Europe > Denmark (0.04)
- South America > Ecuador (0.04)
- (16 more...)
- Overview (0.66)
- Research Report (0.64)
- Instructional Material (0.45)
Archetypal Analysis for Binary Data
Wedenborg, A. Emilie J., Mørup, Morten
Archetypal analysis (AA) is a matrix decomposition method that identifies distinct patterns using convex combinations of the data points denoted archetypes with each data point in turn reconstructed as convex combinations of the archetypes. AA thereby forms a polytope representing trade-offs of the distinct aspects in the data. Most existing methods for AA are designed for continuous data and do not exploit the structure of the data distribution. In this paper, we propose two new optimization frameworks for archetypal analysis for binary data. i) A second order approximation of the AA likelihood based on the Bernoulli distribution with efficient closed-form updates using an active set procedure for learning the convex combinations defining the archetypes, and a sequential minimal optimization strategy for learning the observation specific reconstructions. ii) A Bernoulli likelihood based version of the principal convex hull analysis (PCHA) algorithm originally developed for least squares optimization. We compare these approaches with the only existing binary AA procedure relying on multiplicative updates and demonstrate their superiority on both synthetic and real binary data. Notably, the proposed optimization frameworks for AA can easily be extended to other data distributions providing generic efficient optimization frameworks for AA based on tailored likelihood functions reflecting the underlying data distribution.
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- North America > Montserrat (0.04)
- Europe > Denmark (0.04)
- Asia > India > Telangana > Hyderabad (0.04)
Reviews: Coresets for Archetypal Analysis
This paper looks at the problem of archetypal analysis -- which effectively is a low-rank representation of the data that perhaps has more interpretablility. Instead of finding a low-rank subspace to represent the data, we try to represent each data point as a projection to a convex hull of k-points, where the k points themselves are convex combinations of the original data. The authors present a sampling based method to create coresets for this problem. The main intuition is that the objective function is close to (in fact upper bounded by) a k-means objective, just the "query set" has changed. Given the strong coreset guarantee, a restriction of the query set means that the existing guarantees carry over.